14 research outputs found
T-Crowd: Effective Crowdsourcing for Tabular Data
Crowdsourcing employs human workers to solve computer-hard problems, such as
data cleaning, entity resolution, and sentiment analysis. When crowdsourcing
tabular data, e.g., the attribute values of an entity set, a worker's answers
on the different attributes (e.g., the nationality and age of a celebrity star)
are often treated independently. This assumption is not always true and can
lead to suboptimal crowdsourcing performance. In this paper, we present the
T-Crowd system, which takes into consideration the intricate relationships
among tasks, in order to converge faster to their true values. Particularly,
T-Crowd integrates each worker's answers on different attributes to effectively
learn his/her trustworthiness and the true data values. The attribute
relationship information is also used to guide task allocation to workers.
Finally, T-Crowd seamlessly supports categorical and continuous attributes,
which are the two main datatypes found in typical databases. Our extensive
experiments on real and synthetic datasets show that T-Crowd outperforms
state-of-the-art methods in terms of truth inference and reducing the cost of
crowdsourcing
Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads
LSM-trees are widely adopted as the storage backend of key-value stores.
However, optimizing the system performance under dynamic workloads has not been
sufficiently studied or evaluated in previous work. To fill the gap, we present
RusKey, a key-value store with the following new features: (1) RusKey is a
first attempt to orchestrate LSM-tree structures online to enable robust
performance under the context of dynamic workloads; (2) RusKey is the first
study to use Reinforcement Learning (RL) to guide LSM-tree transformations; (3)
RusKey includes a new LSM-tree design, named FLSM-tree, for an efficient
transition between different compaction policies -- the bottleneck of dynamic
key-value stores. We justify the superiority of the new design with theoretical
analysis; (4) RusKey requires no prior workload knowledge for system
adjustment, in contrast to state-of-the-art techniques. Experiments show that
RusKey exhibits strong performance robustness in diverse workloads, achieving
up to 4x better end-to-end performance than the RocksDB system under various
settings.Comment: 25 pages, 13 figure
Biological Factor Regulatory Neural Network
Genes are fundamental for analyzing biological systems and many recent works
proposed to utilize gene expression for various biological tasks by deep
learning models. Despite their promising performance, it is hard for deep
neural networks to provide biological insights for humans due to their
black-box nature. Recently, some works integrated biological knowledge with
neural networks to improve the transparency and performance of their models.
However, these methods can only incorporate partial biological knowledge,
leading to suboptimal performance. In this paper, we propose the Biological
Factor Regulatory Neural Network (BFReg-NN), a generic framework to model
relations among biological factors in cell systems. BFReg-NN starts from gene
expression data and is capable of merging most existing biological knowledge
into the model, including the regulatory relations among genes or proteins
(e.g., gene regulatory networks (GRN), protein-protein interaction networks
(PPI)) and the hierarchical relations among genes, proteins and pathways (e.g.,
several genes/proteins are contained in a pathway). Moreover, BFReg-NN also has
the ability to provide new biologically meaningful insights because of its
white-box characteristics. Experimental results on different gene
expression-based tasks verify the superiority of BFReg-NN compared with
baselines. Our case studies also show that the key insights found by BFReg-NN
are consistent with the biological literature
Label Propagation for Graph Label Noise
Label noise is a common challenge in large datasets, as it can significantly
degrade the generalization ability of deep neural networks. Most existing
studies focus on noisy labels in computer vision; however, graph models
encompass both node features and graph topology as input, and become more
susceptible to label noise through message-passing mechanisms. Recently, only a
few works have been proposed to tackle the label noise on graphs. One major
limitation is that they assume the graph is homophilous and the labels are
smoothly distributed. Nevertheless, real-world graphs may contain varying
degrees of heterophily or even be heterophily-dominated, leading to the
inadequacy of current methods. In this paper, we study graph label noise in the
context of arbitrary heterophily, with the aim of rectifying noisy labels and
assigning labels to previously unlabeled nodes. We begin by conducting two
empirical analyses to explore the impact of graph homophily on graph label
noise. Following observations, we propose a simple yet efficient algorithm,
denoted as LP4GLN. Specifically, LP4GLN is an iterative algorithm with three
steps: (1) reconstruct the graph to recover the homophily property, (2) utilize
label propagation to rectify the noisy labels, (3) select high-confidence
labels to retain for the next iteration. By iterating these steps, we obtain a
set of correct labels, ultimately achieving high accuracy in the node
classification task. The theoretical analysis is also provided to demonstrate
its remarkable denoising "effect". Finally, we conduct experiments on 10
benchmark datasets under varying graph heterophily levels and noise types,
comparing the performance of LP4GLN with 7 typical baselines. Our results
illustrate the superior performance of the proposed LP4GLN
Truth Inference in Crowdsourcing: Is the Problem Solved?
Crowdsourcing has emerged as a novel problem-solving paradigm, which facilitates addressing problems that are hard for computers, e.g., entity resolution and sentiment analysis. However, due to the openness of crowdsourcing, workers may yield low-quality answers, and a redundancy-based method is widely employed, which first assigns each task to multiple workers and then infers the correct answer (called truth) for the task based on the answers of the assigned workers. A fundamental problem in this method is Truth Inference, which decides how to effectively infer the truth. Recently, the database community and data mining community independently study this problem and propose various algorithms. However, these algorithms are not compared extensively under the same framework and it is hard for practitioners to select appropriate algorithms. To alleviate this problem, we provide a detailed survey on 17 existing algorithms and perform a comprehensive evaluation using 5 real datasets. We make all codes and datasets public for future research. Through experiments we find that existing algorithms are not stable across different datasets and there is no algorithm that outperforms others consistently. We believe that the truth inference problem is not fully solved, and identify the limitations of existing algorithms and point out promising research directions
Learning Decomposed Spatial Relations for Multi-Variate Time-Series Modeling
Modeling multi-variate time-series (MVTS) data is a long-standing research subject and has found wide applications. Recently, there is a surge of interest in modeling spatial relations between variables as graphs, i.e., first learning one static graph for each dataset and then exploiting the graph structure via graph neural networks. However, as spatial relations may differ substantially across samples, building one static graph for all the samples inherently limits flexibility and severely degrades the performance in practice. To address this issue, we propose a framework for fine-grained modeling and utilization of spatial correlation between variables. By analyzing the statistical properties of real-world datasets, a universal decomposition of spatial correlation graphs is first identified. Specifically, the hidden spatial relations can be decomposed into a prior part, which applies across all the samples, and a dynamic part, which varies between samples, and building different graphs is necessary to model these relations. To better coordinate the learning of the two relational graphs, we propose a min-max learning paradigm that not only regulates the common part of different dynamic graphs but also guarantees spatial distinguishability among samples. The experimental results show that our proposed model outperforms the state-of-the-art baseline methods on both time-series forecasting and time-series point prediction tasks
A predictive method to determine incomplete electronic medical records
© 2018 Association for Computing Machinery. This paper is utilizing predictive models to determine missing electronic medical records (EMR) at general practice offices. Prior research has addressed the missing values problem in the EMRs used for secondary analysis. However, health care providers are overlooking the missing records problem that stores the patients’ medical visits information in EMRs. Our study provides a technique to predict the number of EMR entries for each practice based on their past data records. If the number of EMR entries is less than predicted, it warns the occurrence of missing records with the 95% confidence interval. The study uses seven years of EMRs from 14 general practice offices to train the predictive model. The model predicts EMR data entries and accordingly identified missing EMRs for the following year. We compared the actual visits illustrated by de-identified billing data to the predictive model. The study found auto-correlation method improves the performance of identifying missing records by detecting the period of prediction. In addition, artificial neural networks and support vector machines perform better than other predictive methods depending on whether the analysis aims at detecting missing EMRs or when identifying complete EMRs with no missing records. Results suggest that clinicians and medical professionals should be mindful of the potential missing records of EMRs prior any secondary analysis